Probability concepts

Example: We want to make statements about winter days and snow days. This means that there are 4 possible event types:

\[ \begin{alignat}{2} \text{no winter} &\; \cap \; \text{no snow} \\ \text{winter} &\; \cap \; \text{no snow} \\ \text{no winter} &\; \cap \; \text{snow} \\ \text{winter} &\; \cap \; \text{snow} \end{alignat} \] We can visualize the probability that any given day falls into one of these for classes as follows.

Marginal probabilities are just the probabilities of event, independent of anything else

The joint probability can be visualized as the intersection of bot event types winter and snow:

Finally, the conditional probability fixes one event type to a specific value, and asks “Given we fixed this first event type, what is the probability of the second even type.” We are focusing only on this subset off all possible event types:

And the conditional probability is then just the joint probability relative to (divided by) the probability of the first, fixed event type.

If we use simulations to calculate these probabilities, as illustrated in the next figure, the problem is reduced to simple counts:

\[ P(snow|winter) = \frac{\text{number of snowy winter days}}{\text{number of winter days}} \]

Here is the whole thing with A and B instead of snow and winter.

\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} = \frac{\overset{\color{red}{\text{joint probability}}}{P(A, B)}}{\overset{\color{blue}{\text{marginal probability}}}{P(B)}} \] and by multiplying with \(P(B)\) on both sides, we get first

\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} = \overset{\color{red}{\text{joint probability}}}{P(A,B)} \]

which is the same as

\[ \overset{\color{red}{\text{joint probability}}}{P(A,B)} = \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} \]

This is the general product rule (or chain rule) that connects conditional probabilities with joint probabilities.

Deriving Bayes rule

\[ P(A,B) = \color{blue}{P(A|B) \cdot P(B)} \\ P(A,B) = \color{red}{P(B|A) \cdot P(A)} \]

Because both equations have \(P(A,B)\) on the left hand side, we can also write

\[ \color{blue}{P(A|B) \cdot P(B)} = \color{red}{P(B|A) \cdot P(A)} \]

If we want to know what \(P(A|B)\) is, we now have to divide on both sides with \(P(B)\), which gives us

\[ \color{blue}{P(A|B)} = \frac{\color{red}{P(B|A) \cdot P(A)}}{\color{blue}{P(B)}} \]

This is Bayes Rule, which one uses to calculate the inverse conditional probability, i.e. when we have information about the probability of \(B\) given \(A\) (\(P(B|A)\)) and want to calculate the probability of \(A\) given \(B\) (\(P(A|B)\)).

Chapter 3 also introduces a diagnostic example. More generally we can ask the question about what is the probability of a disease given a positive test results \(P(D|T+)\).

\[ P(D|T+) = \frac{P(T+|D)\cdot P(D)}{P(T+)} \]

This tells us the that probability of disease also depends of the unconditional probability (base rate) of the disease and of positive tests.

Sometimes, Bayes rule is expressed in a more complicated manner, if one adds how to calculate \(P(T+)\), which is \(P(T+|not D)\cdot P(notD) + P(T+|D)\cdot P(D)\), however, this is not so relevant for Bayesian inference.

What are probability distribution, and where do we need them for Bayesian statistics?

Probability distributions are a special form of functions.

We use these functions to represent uncertainty.

Before we display uncertainty, let’s look at how this function looks if we are certain to land on water:

If we are certain to land on water, \(P(water)\) = 1 and all other probabilities have the value zero.

If we want to express uncertainty, our function has to allow all possible values (and forbid all impossible values) on the x-axis. If we had no information whatsoever to say something about the probability to land on water, all probabilities should get the same value.

For this function to be a probability distribution, the area under the function (the integral) must sum up to 1.

To see this, we can observe in the next plot that the area under the probability function remains constant while we go from believing weakly (left) to more strongly (right) that the probability to land on water is larger than 0.5.

To summarize, one can think of a probability distribution as a function that expresses how likely different values of a parameter (here p) are and whose area under the curve (or integral) is 1.

In Bayesian statistics, we use such distributions to express three things:

  1. Prior judgement about the probability of different parameter values before seeing the data.
  2. The probability of different parameter values given the data. This is also called the likelihood
  3. The posterior probability of different parameter values given our prior judgement and the data.

Let’s walk through a simple example. We start by describing our prior judgement, that we are slightly confident that that index finger touches water rather than land, with the beta distribution.

First the prior.

Next the likelihood. For the globe tossing example we use the binomial distribution to calculate the likelihood. Let us assume we had 4 trials and 3 successes.

Now Let us re-introduce Bayes rule, which we described above as:

\[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \]

If we just replace \(A\) with \(parameter\) and \(B\) with \(data\) and annotate the different terms we get

\[ \overset{\color{violet}{\text{posterior probability}}}{P(parameter|data)} = \frac{\overset{\color{red}{\text{likelihood}}}{P(data|parameter)} \cdot \overset{\color{blue}{\text{prior probability}}}{P(parameter)}}{\overset{\color{orange}{\text{evidence}}}{P(data)}} \]

This shows us that if we just multiply the likelihood with the prior, we get something that is proportional to the posterior. This is what is meant if you see this expression:

\[ posterior \propto likelihood \cdot prior \]

Let us just calculate and plot this:

posterior.likelihood = prior * likelihood

This distribution is only proportional to the posterior distribution, because the product of posterior and likelihood does not sum up to 1. We can calculate the posterior probability distribution by dividing by the sum.

s = sum(posterior.likelihood)
posterior = posterior.likelihood/s
c(1/s, sum(posterior))
## [1] 77.34548  1.00000

The following figure illustrates how we get from the the un-normalized posterior to the normalized posterior, which sums to 1, by multiplying with a constant, which is just 1/posterior.likelihood.

The next plot shows the posterior distribution together with the prior distribution and the likelihood. We are also adding a plot for the un-normalized posterior with a dotted line.

The figure shows that the posterior is a compromise between the prior distribution and the likelihood.

If we collect five times the data, we become more certain (the posterior distribution is narrower) and the influence of the prior is diminished so that the posterior will be very similar to the likelihood:

So it is not so easy to “cheat” with priors to get what one wants, provided one has collected sufficient data of course.

Ways to calculate the posterior:

Grid approximation

  • We calculate the posterior probability for a number of prior probability values.

This works if we have a limited number of parameters.

To get samples from this probability distribution, we use the posterior probabilities as weights when we sample from the prior values for which we calculate the posterior.

Quadratic (Laplace) approximation

  • We find the mode of the posterior distribution, which is also called the maximum a posteriori or MAP1
  • We use the curvature at this mode2 to approximate how the entire posterior distribution looks like.

This works if the posterior distribution has a (multivariate) normal distribution. (Which also means that it has only one mode)

To get samples we simulate from the (multivariate) normal distribution that describes the posterior.

Markov Chain Monte Carlo (MCMC)

  • A simulation based approach

Also works for non-normal and multi-modal posterior distributions, but needs more time.

MCMC directly produces samples.

Learning things from the posterior

The posterior distribution contains all our knowledge given the prior, model, and data. If possible, we should show the reader the full posterior. If the posterior distribution has a simple form, which is often the case, we can calculate certain statistics to summarize the posterior.

We walk through some concepts with the following posterior.

I prefer histograms over density plots, because density plots can distort things

Mean, median, and mode

## Loading required package: rstan
## Loading required package: StanHeaders
## 
## rstan version 2.26.6 (Stan version 2.26.1)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Loading required package: cmdstanr
## This is cmdstanr version 0.2.2
## - Online documentation and vignettes at mc-stan.org/cmdstanr
## - CmdStan path set to: /Users/guidobiele/.cmdstanr/cmdstan-2.29.1
## - Use set_cmdstan_path() to change the path
## Loading required package: parallel
## rethinking (Version 2.21)
## 
## Attaching package: 'rethinking'
## The following object is masked from 'package:rstan':
## 
##     stan
## The following object is masked from 'package:stats':
## 
##     rstudent

Qunitiles

Q05 = quantile(samples1,.05)
colored.hist(samples1, length.out = 250, upper = Q05)

Credible intervals

An x% credible interval (CI) is the central interval that contains x% of the posterior distribution

Can also be called posterior intervals. This is what some people think confidence intervals are.

CI80 = PI(samples1, prob = .8)
colored.hist(samples1, length.out = 250,
             lower = CI80[1],
             upper = CI80[2])
title(paste("Interval length = ",round(diff(CI80),3)))
abline(v = CI80, lwd = 2)
text(CI80, c(6000,6000), labels = round(CI80,2), pos = c(2,4))

Credible intervals are calculated with quintiles. For instance, if the 80% credible intervals goes from the 10% ((100-80)/2) quintile to the 90% quintile.

Highest density posterior interval

An x% Highest density posterior interval (HDPI) is the shortest interval that contains x% of the posterior distribution.

HPDI80 = HPDI(samples1, prob = .8)
colored.hist(samples1, length.out = 250,
             lower = HPDI80[1],
             upper = HPDI80[2])
title(paste("Interval length = ",round(diff(HPDI80),3)))
abline(v = HPDI80, col = "blue", lwd = 2)
abline(v = CI80, col = "blue", lwd = 2, lty = 3)
text(HPDI80, c(4000,4000), labels = round(HPDI80,2), pos = c(2,4), col = "blue")

Posterior predictions

How do we know that our model is any good?

In the workflow of Bayesian data analysis, we check “model fit” visually, by comparing if the predictions made by the model are consistent with the data.

We do this by taking parameters from the posterior distribution, and then simulate data from the model with these parameters.

If the model and parameters are good, the simulated data should look similar to the observed data.


  1. with some optimization scheme↩︎

  2. How quickly the posterior changes if one moves away from the mode↩︎